Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Li Zhang 206305918

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    
options(repos = "https://cloud.r-project.org")

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(duckdb)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:    8.000 GiB 
Freeram:   123.547 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 zhangli  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 zhangli  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 zhangli  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 zhangli  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 zhangli  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 zhangli  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 zhangli  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 zhangli  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 zhangli  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 zhangli  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 zhangli  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 zhangli  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 zhangli  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 zhangli  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 zhangli  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 zhangli  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 zhangli  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 zhangli  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 zhangli  staff    36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 zhangli  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 zhangli  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 zhangli  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 zhangli  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

  • read.csv in base R
system.time({
  admissions_base_r <- read.csv("~/mimic/hosp/admissions.csv.gz")
})
   user  system elapsed 
  3.111   0.057   3.171 
object_size(admissions_base_r)
158.71 MB
  • read_csv in tidyverse
system.time({
  admissions_tidyverse <- read_csv("~/mimic/hosp/admissions.csv.gz")
})
   user  system elapsed 
  0.859   0.087   0.518 
object_size(admissions_tidyverse)
55.31 MB
  • fread in data.table
system.time({
  admissions_data_table <- fread("~/mimic/hosp/admissions.csv.gz")
})
   user  system elapsed 
  0.326   0.034   0.369 
object_size(admissions_data_table)
50.13 MB

The fread function in the data.table package is the fastest.

The read.csv function in base R uses the most memory. The read_csv function in the tidyverse package uses the least memory.

The default parsed data types are different for the three functions.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

column_types <- cols(
  subject_id = col_double(),
  hadm_id = col_double(),
  admission_type = col_character(),
  admittime = col_datetime(),
  discharge_location = col_character(),
  insurance = col_character(),
  language = col_character(),
  marital_status = col_character(),
  admit_provider_id = col_character(),
  admission_location = col_character(),
  discharge_location = col_character(),
  hospital_expire_flag = col_double(),
  dischtime = col_datetime(),
  deathtime = col_datetime(),
  admission_location = col_character(),
  edregtime = col_datetime(),
  edouttime = col_datetime()
)

system.time({
  admissions_specified <- read_csv("~/mimic/hosp/admissions.csv.gz", col_types = column_types)
})
   user  system elapsed 
  0.791   0.089   0.436 
object_size(admissions_specified)
55.31 MB

The run time is a little shorter.

The result tibble uses the same amount of memory as the result tibble from read_csv without specifying column data types.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 zhangli  staff  1939088924 Jan  5  2023 /Users/zhangli/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

system.time({
  labevents <- read_csv("~/mimic/hosp/labevents.csv.gz")
})

It takes more than 5 minutes to ingest labevents.csv.gz using read_csv.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

library(readr)

system.time({
  labevents_data <- read_csv(
    "~/mimic/hosp/labevents.csv.gz", 
    col_select = c(subject_id, itemid, charttime, valuenum))
})

This does not solve the ingestion issue. It still takes more than 5 minutes to ingest labevents.csv.gz using read_csv with selected columns.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

zcat < ~/mimic/hosp/labevents.csv.gz | 
awk -F, 'BEGIN {OFS=","} {if ($5 == 50912 || $5 == 50971 || $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 || $5 == 51301 || $5 == 50931) print $2,$5,$7,$10}' | 
gzip > labevents_filtered.csv.gz
zcat < labevents_filtered.csv.gz | head -10
zcat < labevents_filtered.csv.gz | wc -l
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
 24855909
system.time({
  labevents_filtered <- read_csv("labevents_filtered.csv.gz")
})
   user  system elapsed 
 11.897   3.829   8.670 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

gzip -d -c ~/mimic/hosp/labevents.csv.gz > /Users/zhangli/203b_hw_new/hw2/labevents.csv
start_time <- Sys.time()
labevents_dataset <- arrow::open_dataset("labevents.csv", format = "csv")
filtered_data <- labevents_dataset %>%
  filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  select(subject_id, itemid, charttime, valuenum)
end_time <- Sys.time()
time_take <- end_time - start_time
cat("Time taken to ingest, select, and filter:", time_take, "s\n")
Time taken to ingest, select, and filter: 0.09190202 s
cat("Number of rows:", nrow(filtered_data), "\n")
Number of rows: 24855909 

The number of rows and the first 10 rows of the result tibble match those in Q2.3.

Apache Arrow helps computers handle really big data faster and more smoothly.

Imagine you have a huge amount of data, much more than your computer’s memory can handle at once. Apache Arrow is like a smart system that helps break down this massive data into smaller, more manageable pieces. These smaller pieces are organized in a way that makes it easy for different programs or systems to work with them efficiently, like having a well-organized library where you can quickly find and borrow the books you need without searching through endless shelves.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

arrow::write_dataset(labevents_dataset, "labevents.parquet")

parquet_size <- file.size("labevents.parquet")
cat("Size of the Parquet file(s):", parquet_size, "bytes\n")
Size of the Parquet file(s): 96 bytes
start_time_parquet <- Sys.time() 
lab_events_parquet <- arrow::open_dataset("labevents.parquet")
filtered_lab_events_parquet <- lab_events_parquet %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum)
end_time_parquet <- Sys.time()
process_time_parquet <- end_time_parquet - start_time_parquet
cat("Time taken to ingest, select, and filter Parquet file(s):", process_time_parquet, "s\n")
Time taken to ingest, select, and filter Parquet file(s): 0.15065 s
num_rows_parquet <- nrow(filtered_lab_events_parquet)
cat("Number of rows in Parquet file(s):", num_rows_parquet, "\n")
Number of rows in Parquet file(s): 24855909 
head(filtered_lab_events_parquet, 10)
FileSystemDataset (query)
subject_id: int64
itemid: int64
charttime: timestamp[ms]
valuenum: double

See $.data for the source Arrow object

The Parquet format is a columnar storage format that is optimized for reading and writing large datasets. It is designed to be efficient for both read and write operations.

Think of Parquet as a clever way to store this data, almost like how a skilled carpenter organizes wood. Parquet breaks down your dataset into smaller, manageable ‘chunks’ and stores them in a highly efficient manner. It’s like having special compartments in a storage room where each compartment holds a portion of your data neatly arranged for easy access. Because of its smart organization, Parquet not only saves space but also makes it lightning-fast to retrieve specific pieces of data when needed.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

# Ingest the Parquet file
start_time_duckdb <- Sys.time()  
lab_events_duckdb <- arrow::open_dataset("labevents.parquet")

# Convert to DuckDB table
lab_events_duckdb_table <- arrow::to_duckdb(lab_events_duckdb)

# Select columns and filter rows based on itemid
filtered_lab_events_duckdb <- lab_events_duckdb_table %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum)

# Time measurement
end_time_duckdb <- Sys.time()
process_time_duckdb <- end_time_duckdb - start_time_duckdb
cat("Time taken to ingest, convert, select, and filter DuckDB table:", process_time_duckdb, "s\n")
Time taken to ingest, convert, select, and filter DuckDB table: 0.3701899 s
# Display the number of rows
filtered_lab_events_df <- as.data.frame(filtered_lab_events_duckdb)
num_rows_df <- nrow(filtered_lab_events_df)
cat("Number of rows in DuckDB table (converted to data frame):", num_rows_df, "\n")
Number of rows in DuckDB table (converted to data frame): 24855909 
# Display the first 10 rows
head(filtered_lab_events_duckdb, 10)
# Source:   SQL [10 x 4]
# Database: DuckDB v0.9.2 [root@Darwin 23.0.0:R 4.3.2/:memory:]
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  

DuckDB is an in-memory analytical database that is designed to be highly efficient for both read and write operations.

It’s designed to be fast and efficient, capable of handling large datasets with ease. Imagine if you have a bunch of files scattered all over your desk, DuckDB helps neatly arrange them into folders and drawers, making it easier for you to find what you need without wasting time.So, in simple terms, DuckDB is a powerful tool that helps computers manage and process data efficiently.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

  • Parquet format
library(data.table)
library(arrow)

data_chartevents <- arrow::open_dataset("~/mimic/icu/chartevents.csv.gz", format = "csv") 
arrow::write_dataset(data_chartevents, "chartevents.parquet")

data_chartevents_parquet <- arrow::open_dataset("chartevents.parquet")
filtered_data_chartevents_parquet <- data_chartevents_parquet %>%
  dplyr::filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) 

print(paste("Number of rows:", 
            nrow(filtered_data_chartevents_parquet)))
[1] "Number of rows: 22502319"
head(filtered_data_chartevents_parquet, 10)
FileSystemDataset (query)
subject_id: int64
hadm_id: int64
stay_id: int64
caregiver_id: int64
charttime: timestamp[ms]
storetime: timestamp[ms]
itemid: int64
value: string
valuenum: double
valueuom: string
warning: int64

See $.data for the source Arrow object